In this project,I will be exploring the relationship between household income levels and food expenditure across different U.S. states. I would like to check if the situtation in U.S supports engel's law which explains that as a household's income increases, the percentage of that income spent on food decreases.
This data contains information on median incomes across states,I chose median instead of the mean because it reduces the influence of extremely wealthy people.
Data Source (This link will direct you to an article, which I collected data from a map there) This data contains state-level numerical information on the proportion of household spending allocated to food.
Furthuremore, I will use spatial data from "/home/jovyan/shared_data/data/geog407/lab3/" in week-3 class to help creating maps.
# Median household income data
import pandas as pd
df = pd.read_csv('median_income.csv')
# Data Cleaning:Rename columns
df_income =df.rename(columns={'Value (Dollars)': 'median_income',
'Rank within US (of 52 states)':'rank','State':'state'})
# Data Cleaning:Change the type of values in "median_income" to float
df_income['median_income'] = df_income['median_income'].str.replace(',', '').astype(float)
df_income.head()
# There are 50 rows and 4 columns
df_income.shape
# See if there is missing value, the result is good.
df_income.isna().sum()
# Data types of all columns look good
df_income.dtypes
# Simple visualization
%matplotlib inline
import matplotlib.pyplot as plt
# Boxplot for median_income is best for visualizing numerical data
# As you can see, Q2 lies in leftward
plt.boxplot(df_income['median_income'],vert=False)
plt.title("Median incomes across US states")
plt.xlabel("median income($)")
# Map visualization
import geopandas as gpd
import matplotlib.pyplot as plt
import os
from datetime import datetime
# Set environment variable needed for basemap
os.environ["PROJ_LIB"] = r"/opt/conda/pkgs/proj4-5.2.0-he1b5a44_1006/share/proj"
# Define data path
data_path = "/home/jovyan/shared_data/data/geog407/assignment3/"
# Load geographic and income data
state_geo = f"{data_path}/us-states.json"
gdf_states = gpd.read_file(state_geo)
# Merge shapefile and income dataframe
gdf_merged = gdf_states.merge(df_income, left_on="name", right_on="state")
# Exclude Alaska and Hawaii to aviod scaling issue.
gdf_contig = gdf_merged[~gdf_merged["name"].isin(["Alaska", "Hawaii"])]
# Plot the map
fig, ax = plt.subplots(figsize=(12, 8))
gdf_contig.plot(
column="median_income",
cmap="YlGnBu",
edgecolor="0.8",
linewidth=0.5,
legend=True,
legend_kwds={
"label": "Median Income ($)",
"orientation": "horizontal"
},
ax=ax
)
# Add title
ax.set_title("Household Median Income by State (Mainland U.S.)", fontsize=16)
ax.axis("off")
# Show the plot
plt.show()
# Filter for Alaska only
gdf_alaska = gdf_merged[gdf_merged["name"] == "Alaska"]
# Create figure and axis
fig, ax = plt.subplots(figsize=(8, 8))
# Plot Alaska's median income
gdf_alaska.plot(
column="median_income",
cmap="YlGnBu",
edgecolor="0.8",
linewidth=0.5,
legend=True,
legend_kwds={
"label": "Median Income ($)",
"orientation": "horizontal"
},
ax=ax
)
# Add title and remove axis
ax.set_title("Household Median Income in Alaska", fontsize=16)
ax.axis("off")
# Show the map
plt.show()
# Filter for Hawaii only
gdf_hawaii = gdf_merged[gdf_merged["name"] == "Hawaii"]
# Create figure and axis
fig, ax = plt.subplots(figsize=(8, 8))
# Plot Hawaii's median income
gdf_hawaii.plot(
column="median_income",
cmap="YlGnBu",
edgecolor="0.8",
linewidth=0.5,
legend=True,
legend_kwds={
"label": "Median Income ($)",
"orientation": "horizontal"
},
ax=ax
)
# Add title and remove axis
ax.set_title("Household Median Income in Hawaii", fontsize=16)
ax.axis("off")
# Show the map
plt.show()
# Create interactive map using folium
import pandas as pd
import folium
import json
# Define data path
data_path = "/home/jovyan/shared_data/data/geog407/assignment3/"
state_geo = f"{data_path}/us-states.json"
# Load GeoJSON
with open(state_geo) as f:
geo_data = json.load(f)
# Add median_income into the GeoJSON to display it in popups
for feature in geo_data["features"]:
state_name = feature["properties"]["name"]
match = df_income[df_income["state"] == state_name]
if not match.empty:
# Get the first (and only) value from the match
feature["properties"]["median_income"] = int(match["median_income"].values[0])
else:
feature["properties"]["median_income"] = None
# Make the base map
m = folium.Map(location=[48, -102], zoom_start=3)
# Add Choropleth layer
choropleth = folium.Choropleth(
geo_data=geo_data,
name="choropleth",
data=df_income,
columns=["state", "median_income"],
key_on="feature.properties.name",
fill_color="YlGnBu",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Median Income ($)",
).add_to(m)
# Hover tooltip to see state names
folium.GeoJsonTooltip(
fields=["name"],
aliases=["State:"],
sticky=True
).add_to(choropleth.geojson)
# Click popup to show both name and income
folium.GeoJsonPopup(
fields=["name", "median_income"],
aliases=["State:", "Median Income:"],
localize=True
).add_to(choropleth.geojson)
# Add layer control
folium.LayerControl().add_to(m)
# Display map in notebook
m
# Save map as HTML
#m.save("Median_income_map.html")
# Display my folium map
from IPython.display import Image
Image("Median_income_map.png")
# Food Expenditure share data
df_share = pd.read_csv('states_food_share.csv')
df_share.head()
# There are 50 rows and 2 columns
df_share.shape
# See if there is missing value, the result is good.
df_share.isna().sum()
# Data types of all columns look good
df_share.dtypes
# Simple visulization
plt.boxplot(df_share['average_percentage_of_income_on_food'],vert=False)
plt.title("average_percentage_of_income_on_food across US states")
plt.xlabel("Percentage")
# Mainland U.S
data_path = "/home/jovyan/shared_data/data/geog407/assignment3/"
state_geo = f"{data_path}/us-states.json"
gdf_states = gpd.read_file(state_geo)
# Merge shapefile and income dataframe
gdf_merged = gdf_states.merge(df_share, left_on="name", right_on="state")
# Removed Alaska and Hawaii to aviod scaling issue
gdf_contig = gdf_merged[~gdf_merged["name"].isin(["Alaska", "Hawaii"])]
fig, ax = plt.subplots(figsize=(12, 8))
# Plot the map
gdf_contig.plot(column="average_percentage_of_income_on_food",cmap="YlGnBu",
edgecolor="0.8",
linewidth=0.5,
legend=True,
legend_kwds={
"label": "Average % of Income Spent on Food",
"orientation": "horizontal"
},
ax=ax)
# Set title
ax.set_title("Average % of Income Spent on Food by State (Mainland U.S.)", fontsize=16)
ax.axis("off")
# Show the map
plt.show()
#State of Alaska
gdf_alaska = gdf_merged[gdf_merged["name"] == "Alaska"]
fig, ax = plt.subplots(figsize=(8, 8))
gdf_alaska.plot(
column="average_percentage_of_income_on_food",
cmap="YlGnBu",
edgecolor="0.8",
linewidth=0.5,
legend=True,
legend_kwds={"label": "Average % of Income Spent on Food", "orientation":"horizontal"},
ax=ax)
ax.set_title("Average % of Income Spent on Food in Alaska", fontsize=16)
ax.axis("off")
plt.show()
#State of Hawaii
gdf_alaska = gdf_merged[gdf_merged["name"] == "Hawaii"]
fig, ax = plt.subplots(figsize=(8, 8))
gdf_alaska.plot(
column="average_percentage_of_income_on_food",
cmap="YlGnBu",
edgecolor="0.8",
linewidth=0.5,
legend=True,
legend_kwds={"label": "Average % of Income Spent on Food", "orientation":"horizontal"},
ax=ax)
ax.set_title("Average % of Income Spent on Food in Hawaii", fontsize=16)
ax.axis("off")
plt.show()
import pandas as pd
import folium
import json
data_path = "/home/jovyan/shared_data/data/geog407/assignment3/"
state_geo = f"{data_path}/us-states.json"
# Load GeoJSON
with open(state_geo) as f:
geo_data = json.load(f)
# Add average_percentage_of_income_on_food into GeoJSON properties
for feature in geo_data["features"]:
state_name = feature["properties"]["name"]
match = df_share[df_share["state"] == state_name]
if not match.empty:
feature["properties"]["average_percentage_of_income_on_food"] = float(
match["average_percentage_of_income_on_food"].values[0])
m = folium.Map(location=[48, -102], zoom_start=3)
# Plot the map
choropleth = folium.Choropleth(
geo_data=geo_data,
name="choropleth",
data=df_share,
columns=["state", "average_percentage_of_income_on_food"],
key_on="feature.properties.name",
fill_color="YlGnBu",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Average percentage of Income Spent on Food",).add_to(m)
# Click popup to show both name and food share percentage
folium.GeoJsonTooltip(
fields=["name"],
aliases=["State:"],
sticky=True).add_to(choropleth.geojson)
folium.GeoJsonPopup(
fields=["name", "average_percentage_of_income_on_food"],
aliases=["State:", "Avg % Income on Food:"],
localize=True).add_to(choropleth.geojson)
# Layer control
folium.LayerControl().add_to(m)
m
#m.save("Spending_map.html")
#Display my folium map
Image("Spending_map.png")
From the map visualizations, a consistent pattern emerges: states with higher incomes tend to spend a smaller percentage of their income on food. For example, in California—a high-income state—the average percentage of income spent on food is roughly 12%, whereas in Alabama—a lower-income state—it is closer to 17%. This general trend supports Engel’s Law, which states that as a household’s income increases, the percentage of that income spent on food decreases.
I used k-mean clustering to give us more insight about the trend.
# Merge two datasets together
df_merged = df_income.merge(df_share, on='state', how='inner')
df_merged.head()
from sklearn.cluster import KMeans
# Define the run_kmeans function
def run_kmeans(dataset, max_iterations=100, num_clusters=5, num_seeds=10):
X = dataset[['median_income', 'average_percentage_of_income_on_food']]
km = KMeans(
n_clusters=num_clusters,
init='k-means++',
max_iter=max_iterations,
n_init=num_seeds,
random_state=42
)
clusters = km.fit_predict(X)
dataset['cluster_id'] = clusters
return dataset, km
# Run K-Means
clustered_income, km_model = run_kmeans(df_merged, num_clusters=5)
# Display results
clustered_income
K-means clustering helped me identify similarities between states. The results show a negative trend in which higher-income states tend to spend a smaller percentage of their income on food. This result supports the conclusion I previously made from the map visualizations.
# Visualize the clusters
plt.figure(figsize=(8, 6))
# Scatter plot of the two features, colored by cluster_id
plt.scatter(
clustered_income['median_income'],
clustered_income['average_percentage_of_income_on_food'],
c=clustered_income['cluster_id'],
cmap='tab10',
s=50,
alpha=0.8
)
# plot cluster centers
centers = km_model.cluster_centers_
plt.scatter(
centers[:, 0],
centers[:, 1],
c='black',
s=200,
marker='X',
label='Centroids'
)
plt.title("K-Means Clustering of Income vs Food Spending")
plt.xlabel("Median Income")
plt.ylabel("Average % of Income on Food")
plt.legend()
# Use plt.grid for a more organized graph
plt.grid(True)
plt.show()
I also did more data analysis by fitting a model, our primary goal is to explore the relationship between income and the proportion of income spent on food. Even with a R^2 of 0.678, the model clearly shows the direction of the relationship, which is the exact insight we are looking for.
#Import libraries
import seaborn as sns
sns.set()
import numpy as np
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
df_new = df_merged[['median_income','average_percentage_of_income_on_food']]
df_new=df_new.rename(columns={'median_income': 'income','average_percentage_of_income_on_food':'income_on_food'})
sns.pairplot(df_new)
lm = smf.ols(formula='income_on_food~income',data=df_new).fit()
lm
Overall, the residual plot looks good, showing no major violations of the linearity assumptions. There are, however, a few points that are possible outliers, but overall the model appears satisfactory.
# Residuals vs. Fitted Values Plot
# This plot shows the distribution of residuals.
# If the points are evenly spread around the red dashed line
# it suggests that the linearity assumption of the model meets.
# plot the dots
plt.scatter(lm.fittedvalues, lm.resid)
# Draw the red dashed line
plt.axhline(y=0,color='r',linestyle='--')
plt.xlabel('Fitted values vs. Residuals plot')
plt.ylabel('Residuals')
plt.title('fitted value VS Residual plots')
# Fit a model
chosen_lm = smf.ols(formula='income_on_food~income',data=df_new).fit()
chosen_lm.summary()
For data with 50 rows((n=50)) and one independent variable(Income), degrees of freedom is 48 (n-p=50-2=48). t-value for 48 degrees of freedom: 2.011
# 95% estimation interval for income variable
upper_bound = (-1.407e-06)+(2.011*(1.4e-07))
lower_bound = (-1.407e-06)-(2.011*(1.4e-07))
print(upper_bound)
print(lower_bound)
[-1.68854e-06, -1.12546e-06]
The income coefficint is always below 0, indicating the forever negative relationship between income and income_on_food, this support the thoery of engle's law.
Conclusion: As you can see, the estimation interval is below 0, indicating the forever negative relationship between income and income_on_food, This additional data analysis supports the conclusion I made from the map visualizations and further reinforces Engel’s Law.